GH-3475: Fix parquet-vector compatiblity with Java > 17#3476
GH-3475: Fix parquet-vector compatiblity with Java > 17#3476iemejia wants to merge 1 commit intoapache:masterfrom
Conversation
4527830 to
313db67
Compare
Replace the ByteBuffer-specific vector loads with local helpers that copy the required bytes and then call ByteVector.fromArray. This removes the dependency on JDK-specific ByteVector.fromByteBuffer entry points, which can fail with NoSuchMethodError on newer runtimes. Assisted-by: OpenCode:gpt-5.4
There was a problem hiding this comment.
Pull request overview
Updates the vector-encoding plugin to avoid JDK-specific ByteVector.fromByteBuffer(...) entry points (which can break on newer Java runtimes), and expands CI coverage to build/test the vector plugins on more JDK versions.
Changes:
- Replace
ByteVector.fromByteBuffer(...)loads with localfromByteBuffer(...)helpers that copy bytes into an array and callByteVector.fromArray(...). - Add helper methods to read bytes from
ByteBufferfor vector loads (masked and unmasked). - Expand the vector-plugins GitHub Actions matrix to run on JDK 17/21/25 and conditionally skip Spotless on non-17 JDKs; ignore Eclipse
.factorypath.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| parquet-plugins/parquet-encoding-vector/src/main/java/org/apache/parquet/column/values/bitpacking/ByteBitPacking512VectorLE.java | Reworks ByteBuffer-based vector loads to go through array-backed helpers instead of fromByteBuffer. |
| .github/workflows/vector-plugins.yml | Expands CI to additional Java versions and adjusts Spotless skipping behavior. |
| .gitignore | Ignores .factorypath. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| private static ByteVector fromByteBuffer( | ||
| VectorSpecies<Byte> species, ByteBuffer input, int inPos, VectorMask<Byte> mask) { | ||
| return ByteVector.fromArray(species, readInputBytes(input, inPos, mask.trueCount()), 0, mask); | ||
| } |
There was a problem hiding this comment.
In the masked overload, readInputBytes(..., mask.trueCount()) creates a backing array smaller than species.length(). ByteVector.fromArray(species, array, 0, mask) may still perform bounds checks assuming array.length >= species.length() (and also breaks immediately if a mask ever has a true lane at an index >= array.length). To keep masked loads safe across JDK implementations, allocate an array of species.length() and only fill the needed prefix bytes (e.g., mask.trueCount()), leaving the rest as zero.
| private static ByteVector fromByteBuffer(VectorSpecies<Byte> species, ByteBuffer input, int inPos) { | ||
| return ByteVector.fromArray(species, readInputBytes(input, inPos, species.length()), 0); | ||
| } | ||
|
|
||
| private static ByteVector fromByteBuffer( | ||
| VectorSpecies<Byte> species, ByteBuffer input, int inPos, VectorMask<Byte> mask) { | ||
| return ByteVector.fromArray(species, readInputBytes(input, inPos, mask.trueCount()), 0, mask); | ||
| } | ||
|
|
||
| private static byte[] readInputBytes(ByteBuffer input, int inPos, int byteCount) { | ||
| byte[] bytes = new byte[byteCount]; | ||
| ByteBuffer source = input.duplicate(); | ||
| source.position(inPos); | ||
| source.get(bytes); |
There was a problem hiding this comment.
readInputBytes allocates a new byte[] and creates a new ByteBuffer via duplicate() on every vector unpack from a ByteBuffer. This is a significant per-call allocation/copy regression compared to the previous ByteVector.fromByteBuffer(...) path and can materially impact the performance benefits of the vector plugin. Consider reducing allocations by reusing a scratch buffer (e.g., via ThreadLocal<byte[]> sized to the max species length) and/or fast-pathing heap buffers (input.hasArray()) to load directly from the backing array when possible.
| private static ByteVector fromByteBuffer(VectorSpecies<Byte> species, ByteBuffer input, int inPos) { | |
| return ByteVector.fromArray(species, readInputBytes(input, inPos, species.length()), 0); | |
| } | |
| private static ByteVector fromByteBuffer( | |
| VectorSpecies<Byte> species, ByteBuffer input, int inPos, VectorMask<Byte> mask) { | |
| return ByteVector.fromArray(species, readInputBytes(input, inPos, mask.trueCount()), 0, mask); | |
| } | |
| private static byte[] readInputBytes(ByteBuffer input, int inPos, int byteCount) { | |
| byte[] bytes = new byte[byteCount]; | |
| ByteBuffer source = input.duplicate(); | |
| source.position(inPos); | |
| source.get(bytes); | |
| private static final ThreadLocal<byte[]> INPUT_SCRATCH = | |
| ThreadLocal.withInitial(() -> new byte[B512.length()]); | |
| private static ByteVector fromByteBuffer(VectorSpecies<Byte> species, ByteBuffer input, int inPos) { | |
| if (input.hasArray()) { | |
| return ByteVector.fromArray(species, input.array(), input.arrayOffset() + inPos); | |
| } | |
| return ByteVector.fromArray(species, readInputBytes(input, inPos, species.length()), 0); | |
| } | |
| private static ByteVector fromByteBuffer( | |
| VectorSpecies<Byte> species, ByteBuffer input, int inPos, VectorMask<Byte> mask) { | |
| if (input.hasArray()) { | |
| return ByteVector.fromArray(species, input.array(), input.arrayOffset() + inPos, mask); | |
| } | |
| return ByteVector.fromArray(species, readInputBytes(input, inPos, mask.trueCount()), 0, mask); | |
| } | |
| private static byte[] readInputBytes(ByteBuffer input, int inPos, int byteCount) { | |
| byte[] bytes = INPUT_SCRATCH.get(); | |
| for (int i = 0; i < byteCount; i++) { | |
| bytes[i] = input.get(inPos + i); | |
| } |
Replace the ByteBuffer-specific vector loads with local helpers that copy the required bytes and then call ByteVector.fromArray. This removes the dependency on JDK-specific ByteVector.fromByteBuffer entry points, which can fail with NoSuchMethodError on newer runtimes.
Assisted-by: OpenCode:gpt-5.4
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?